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ABSTRACT 



Information Theory is applicable to a number of fields 



derived for the discrete case using Bayes’s rule for the 
probability of causes . Various properties of this function 
are derived and discussed. 
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1 o Introduction 



In 1948 Claude Shannon published his now famous paper 
entitled "The Mathematical Theory of Communication" ^ 11 j , 
later reprinted with a paper by Warren Weaver as reference 
^ 12 I , in which he defined a communication system as being 
composed of an information source, a transmitter, a channel, 
a receiver, and the destination. The fundamental problem in 
such a system is the reproduction at the destination either 
exactly or approximately a message selected at the informa- 
tion source. His approach was statistical in nature, that is 
he did not consider the semantic aspects of the message but 
rather that the message is one of a set of possible messages . 
Among other things he showed that under certain conditions it 
is possible to encode the transmitted information so that it 
would be received with an arbitrarly small frequency of error 
Basic to his arguments is a concept known in thermody- 
namics as entropy, which can be described as a measure of the 
amount of disorder in a physical system. This implies that a 
message, in some sense, represents a certain amount of dis- 
order and that there is a measure of this disorder which can 
be obtained and used. 

Kullbach [ ^ ] recently published a book in which he 
uses the same basic concept to provide a unifying background 
for the study of the testing of statistical hypotheses . 
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Bagno [ 1 J uses Shannon's theorems to arrive at some start- 
ling conclusions relative to economic theory <. Miller's [^] 
article presents a short discussion on the use of these con- 
cepts in the field of psychology. Thus Information Theory 
apparently has a wide field of applications . 

What is information? One dictionary^ lists among its 
several definitions " . o . knowledge communicated or received 
concerning some fact .,o". We say that knowledge is certain- 
ty and that Information Theory is the study of the ability of 
systems to transmit certainty or equivalently, of the change 
in an observer's state of uncertainty when he has performed 
experiments on a situation and drawn conclusions (gained 
knowledge) from the results . 

In the following pages we will develop the basic sta- 
tistic of Information Theory and show that it is a logical 
and appealing measure of information in the above sense . The 
basis of the development will be Probability Theory and 
specifically an application of Bayes's rule. 



The New Century Dictionary, Appleton, Century. Crofts, 1948 
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2 . Bayes Rule and Information 

Let us perform a simple experiment and see if there xs 

a pattern or characteristic which can be exploited to obtain 

a statistic which relates a situation prior to one or more 

observations to the situation after the observations o 

We can represent the situation by A, where A is composed 

of m mutually exclusive and exhaustive events, a, ,a_, ...,a ; 

1 2 ro 

the outcomes of an observation by B, where B is composed of n 

mutually exclusive and exhaustive events b ,b , ,„o,b » An 

12 n 

event a^occurs with probability p(a^) , 0 < ^ 



Consider that we have been given two nickels, told that 
one is fair and the other biased so the probability of heads 
appearing when tossed is and asked to determine which is 
the fair nickel. Let the experiment consist of choosing one 
of the two nickels at random, tossing it twice and noting 
the outcome of both tosses . Based on the results of this 
experiment we are to state the probability that the chosen 
nickel is fair , 

Let a^^ denote the event: the fair nickel is chosen, and 

a denotes the event; the biased nickel is chosen. Let b 
2 1 

denote the event: the nickel comes up heads, and b^ denote 

the event; the nickel comes up tails . 
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Since we make a random choice, the probability that the 
fair nxckel is chosen, p (a^) , is equal to ^5, and the proba- 
bility that the biased nickel is chosen, P(^2^ also equal 
to ^5 . 

The conditional probability of a specific outcome will 

be denoted by p(b./a.) where 

J 1 

p(b^/a^) = 1/2, 

p(b2/ai) = 1/2, 

P(bi/a2) = 1/4, 

and ~ 

Considering now the first toss of the coin, from the 

definition of conditional probability^ we can write 

p(a.b.) = p(b./a.) p(a.) ( 1 ) 

i J J ^ ^ 

where a b denotes the joint occurrence of event* a. and out- 

i j 1- 

come b » When the a are mutually exclusive and exhaustive 

j 1 

events , 

p(b.) = VpCb /a ) p(a ) . ( 2 ) 

J *rr‘ J ^ ^ 

/ 

The conditional probability of event a^ is the ratio of the 
probability of the joint occurrence of both events to the 
probability of the outcome, or in symbols 



W. Feller, An Introduction to Probability Theory and Its Ap- 
plications, Second Edition, John Wiley & Sons Inc., Ch . V, 
1957 . 
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P(a. A> .) 
1 J 



( 3 ) 



P (a.b .) 

1 1 

p(b J 

Substituting (1) and (2) into (3) . 



p(b./a ) p(a ) 

p (a . /b .) — 7 — ^ . , 

1 J P(3^) 



(4) 



/ 

which is Bayes's rule for the probability of causes^ if we 

identify the event a. with the cause and outcome b. with the 

1 J 

effect . 

Using the given values we can calculate the conditional 
probability of a^, which in the context of Bayes's rule is 
known as the aposteriori probability of event a^j as con- 
trasted with p(a^) , the apriori probability of event a^ . 
Displayed in tabular form. 



p(a^/bj (5) 




Thus after one toss of the nickel, the aposteriori probabil- 
ity of having chosen the fair nickel is 2/3 if we had ob- 
served heads, and 2/5 if we had observed tails. If the se- 
quence, selection and toss, was repeated a large number of 
times, the aposteriori probabilities represent the frequency 



loc . cit„ 
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with which the observer would be correct if he associated a 



given choice with a given outcome , 

We will now toss the coin for the second time. Defining 



the joint outcome j = 1,2, as the pair of observations 

made in the two tosses, we can extend (4) to 



P(b b /a ) p(a ) 

p(a /b b ) = ^ o (6) 

J ^p(bjb^/a^) p(a^) 

/ 

The conditional probability of the outcome of a single toss 

remains the same once a coin is chosen, therefore 

p(b /a.) = p(b./a.) for j = k. (7) 

Jc 1 J ^ 

Since the tosses are independent, 

Substituting (8) into (6) , 



p(a. A) .b ) 
1 j k 



P(b ./a.) 

J i__ 

^p(bj/a^) 



p 

P (b^/a, ) 



p(a.) 

1 

P (a^) 



/ 



(9) 



We are now in a position to calculate the aposteriori prob- 
ability of event a^ after two observations , Again using the 
given values and exhibiting the results in a table 
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( 10 ) 



b .b ^i 

J k 


p(a./b .b : 

1 J k 

1 2 


1 1 


4/5 


~l/5 


1 2 


4/7 


3/7 


2 1 


4/7 


3/7 


2 2 1 


4/13 


9/13 



■ H 



Thus if after two tosses we had observed heads-heads, the 
aposteriori probability that the coin is fair would be 4/5 . 

Notice that for each additional toss the size of the 
table required to describe the possible outcomes is doubled . 
If instead of this almost trivial example we had set m and n 
equal to 50, the table required to describe the situation 
would be enormous. It would be desirable to have a simpler 
way of presenting this data , 

The denominator in (9) can be written in a manner anal- 



ogous to (2) as p (b .b ) . Multiplying (9) by 

J ^ 



E 



p(b./a.) p(a.) - 

/ 1 i 1 and rearranging, 



p(bj/a. 



) P(a.) 



p(a./b.b ) = 

1 j k 

Z!p(l5./a )p(a ) 

■K. X # 1 X X 

^p(b./a.) p(b /a.) p(a.) 

^ J 1 K 1 1 



P(l> ./a.) 
J i_ 



^P(bj 



/a.) p(aj 



p(a^) . (11) 



Simplifying the summations, 



p (a , /b .b ) = 

1 J k 



p(b /a ) p(b ) 

K. X ] 

p(bjb^) 



P (b ./a . ) 
1 1 

P(bj) 



p(\) 



(12) 



Notice that the expression for the aposteriori probability 
of event a^ after the first observation, (4) , appears on the 
right side of (11) and is multiplied by a fraction whose val- 
ue is determined by the outcome c£ the second observation » 

Thus we can say that the aposteriori probability of event 
a^ after the first observation becomes the apriori proba- 
bility of event a. before the second observation. If we rep- 

^ p(b /a.) p(b.) 

}C 1 1 

resent the fraction ;; — — r — in (12) by F„, and the 

P(b b ) 2 

P(b /ai) J 

fraction p (t> ^ (^2) by F^, we can write (3) as 



p (a. /b ,) = F p (a. ) 
1 j 1 1 

and (12) as 



P(a^/b.b^) = p(a.) 



(13) 



(14) 



It is apparent that this could be extended to any number of 
observations. The aposteriori probability after say n ob- 
servations is the product of n F factors and the apriori prob- 
ability for the first observation. This then is the pattern 
which we will use „ 

We will define information as a statistic which meas- 
ures the change in an observers belief as to which event a. 

1 

from a set of mutually exclusive and exhaustive events A, is 
the cause of outcome b^ from a mutually exclusive and ex- 
haustive set of outcomes B, where ^p (a^) = 1, J~p (b^ ) = 1, 
and 0 < p (a^) , p(b^) < 1, We require the following mathe- 
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matical properties of this statistic: a, additivity, and 

b. dependence on apriori and aposteriori probabxlities . Ad- 
ditivity requires that the total amount of information ob- 
tained from a sequence of observations is the sum of the a- 
raounts obtained from the individual observations . 

The defined above do relate to apriori and aposte- 
riori probabilities The requirement of additivity can be 
met by use of the function log F for 

log (F F^oo.F ) = log{F J + log(F^) + ... + log (F ). (15) 
1 ^ n 1 2 n 



Shannon 



11 , 12 



uses base two logarithms defining the 
unit of information as a bit, a contraction of binary digit. 
One bit of information corresponds to being informed of the 
outcome of a binary equally likely selection, a unit which 
is convenient since a relay or flip flop circuit can store 
one bit of information. Kullback 7 j uses natural loga- 
rithms since his work involves integration and differenti- 
ation and others have used base 10 . 
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3 . Uncertainty 



We have indicated that a measure of the total change in 
belief of an observer is the sum of the logarithms of the F 
factors . Let us now look at the information obtained from 



the first observation of a sequence, designated as I 

From the definition of F^, 
p(b ,/a ) 

I = log , 

a . b . P (b . ) 

1 J J 



a ,b . 
1 J 



(16) 



and from (1) and (3) we see that we can also represent I 

p (a^/b ) 

as I , = log ^ 

a.b. p(a.) 

1 J 1 

P (a^b .) 

and I = log ^ 



1 J 



p(a^) pih.) 



a.b. 
1 J 

(17) 



(18) 



Confining our attention to (17) for the moment, if p{a./b^)is 
greater than P(^j^) ' the information is positive? if less 
than p (a^) , the information is negative . Positive informa- 
tion corresponds to an increase in certainty If p(a./b.) 

^ J 

was equal to one, we would be certain that a^ was the cause 

of our observed outcome . We will define the uncertainty of 

event a. as the value of I , when p(a./b.) = 1 thus the 
X a.b. ^ 1 J 

1 J 

uncertainty of event a^ is - log p (a^) . This is the maximum 

amount of information which can be obtained concerning a in 

i 

one observation We are more concerned however with the en- 



tire set A so let us first obtain the average uncertainty 
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which we will denote by H (A) . 

H(A) = -^p(a^) log P(^^). (19) 

/ 

H (A) is descriptive of set A or of any other set with the 
same number of events and the same probability distribution. 
We can also speak of the average uncertainty of observation 
set B, and the event-outcome pair set AB as 



H(B) = -V'pCb.) log p(b.) 

» 3 3 



( 20 ) 



and 



H 



(AB)= -y^y^p (a .b .) log p(a.b.) . 



( 21 ) 



' J 

Consider now the average information obtained on a par- 
ticular event, a^, if we average over all possible outcomes 
of the first observation. We will designate this average as 






= ^P(bj/a.) I . 

1 1 j ■’ 1 J 

J "■ 



( 22 ) 



This average is obviously equal to zero when p (b /a ) 

j 1 

is equal to p(bj); recalling from (2) that P(l^j) Is equal to 

Z p(b./a.) P(a.)f we see that this is only possible when 

. J X 1 

/ 

p(a.) is equal to unity, or when all the p (b /a ) are equal 
1 j i 

for a given i. In the first case we would obtain no informa- 



tion since a^ is the only event which can occur, and in the 
second case we would obtain no information since b is inde- 

j 



11 




r 



pendent of a. . We will show that in all other cases I ^ 

1 1 

is greater than zero. An inequality from Feinstein [^] 
be written as 



^i 



for O 



log log 

/ / 

Sli- ^Pi = 1' = 



(23) 



1 — 0 * * ’ 0 ^ 0 tin 



/ / 

equality only when • Separating the logarithm in (22) 



I „ = yp(b./a.) log p(b./a.) -Yp(b./a.) log p (b 

a B £_Ji J14-J1 J 

^ ^ J J 



(24) 



and identifying p(b./a.) with q,, p(b.) with p., we see that 

J L 1 J 1 

the second term on the right is always less than or equal to 
the first term. Now since the logarithms of numbers less 
than one are negative, the difference is always greater than 
or equal to zero „ Therefore the average information from 
one observation on a single event is a positive quantity 
corresponding to a decrease in uncertainty, or is zero cor- 
responding to no change . Thus 



I R > 0 (25) 

a . B 
1 1 

We will now show that the average information obtained 
from a sequence of observations on the same event cannot ex- 
ceed the uncertainty of the event. From (3) and (12) we can 

write F„ as 
2 



■^2 = 



P(a^/b.) 



(26) 



and the average information obtained from the second obser- 
vation as 



12 



( 27 ) 



i 2 



v-'r— p(a./b.b ) 

P(t>,/a.) log p(a.^J 



J K '1 

By an argument identical to that used to prove (25) this 

average is also greater than or equal to zero . Separating 

the logarithms in I and I we can write the sum as 

a.B 3 Bo 

1 1 L 2 



I 



a B 
i 1 



+ I 



a.B 
X 






+ 



y'VpCb /a.) p(b./a.) logp(a./b.b ) 

J K 



-II} 



p(b, /a.) p(b./a.) log p(a.A>.) 
k 1 J 1- 1 J 



(28) 



J K 

If we now sum the last term on the right over k, we see that 
the result is identical to the first term, thus both terms 
cancel out, leaving us 



I + I T, 

a.B a.B 

1 1 X 2 



J K 



I- 



P(b./a.) 



log P(a^) 



(29) 



J 

Similarly, in a sequence of say n observations, the first 
term in the expression for the average information from one 
observation in the sequence will cancel out with the second 
term in the expression for the average information from the 
next observation in the sequence. The sum will therefore 
consist of two terms such as 






a.B 
X r 



J K 



, ,b /a . ) 
m X 



log p(a. A>.b^ . • -b ) 
X J k m 



2 

J 



P(b/a.) 



log P(a^) 



(30) 



13 



where b is the first outcome and b the last outcome. Now 
J 

the maximum value which the first term on the right can at- 
tain is zero, corresponding to p(a./b.b . » ob ) = 1, and the 

1 J jc rn 

second term is just the uncertainty of event a^ . Thus 



r = h 

max ) I 






a B “ 
r=i 1- r 



(31) 



Returning to the case of a single observation, let us 

average I over all events a. and see what the overall 

3. . B _ 1 

1 1 

average, that is, over both events and outcomes, looks like 
By (18) we can write (22) as 



"a.B 
I 1 



r- P(a b ) 

= )p(b./a ) log — ; - r ~ \ • 

Zj 1 r p(a ) p(b ) 

J J 



(32) 



The average of (32) over all events, designated by is 

h ' Ip'V pToiTb.) 

/ J ^3 

By (1) and (3) and separating the logarithms. 



(33) 



(a^b j log p(a^by - log P(a^) “log P(bj)j .(34) 

/ J 

By (19) , (20) and (21) we see that 



= -H(AB) + H(A) + H(B) . 



(35) 



Since we have established that I is nonnegative, I must 

a.B 1 

1 

also be nonnegative . This implies that 

H(AB) < H(A) + H(B) , (36) 

with equality when A and B are independent, or one or both 
consist of one event with probability one . We can also write 
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(33) in the form 






Vp(b./a ) log p(b ./a,)-y'p(b./a.) log p(b.) 

. J J 



- (37) 



The first term inside the brackets is similar in form to 
(19) , (20) and (21) and we will define the uncertainty of 

outcome set B, given a specific event a^ as 



H(B/aJ = -y'p(b,/a.) log p(b./a,). 
1 J 1 J 3- 

J 



(38) 



Rearranging the right side of (37), and substituting (38) , 

(a^bj) log P(t)j) • (39) 



f I J 

Now the first term on the right is the average conditional 
uncertainty of B and the second term is the uncertainty of B, 
thus, denoting the average conditional uncertainty by H (b/A) , 



= -H(B/A) + H(B) (40) 

Since is nonnegative, 

H(B/A)<H(B), (41) 



with equality as in (36) . 

(16) and (17) are symmetric expressions in a^ and b^ so 
an equivalent averaging of (17) would yield 

= -H(A/B) + H(A) (42) 

and 



H(A/B) < H(A) . 



(43) 
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4o Properties of the Uncertainty Statistic 



A number of properties were obtained in the previous 
section namely; 

H(AB) < H(A) + H(B) (36) 

H(B/A) < H(B) , (41) 

and H(A/B) < H (A) . (43) 



We will now consider this function in its own right and 
obtain some additional properties , Subtracting (40) from (35) 
we find that 



H(AB) = H(A) + H(B/A) , 
and subtracting (4 2) from (35) , 



(44) 



H(AB) = H(B) + H(A/B) . (45) 

Since H is of the form — represent it 



as H(p^ , p^-- . - = 

lim X log x = 0. 
x-^0 



P_) 

m 



By the property of the logarithm, 



H(p. , P„, . ^ = H(p , p , , , , p ) . (46) 

12 m 12 m 

which indicates that adding an impossible event to a set does 
not change the average uncertainty. 

Let us now determine the probability distribution for 
which the H function takes on its maximum value . Using the 
method of Lagrange multipliers^ and writing H as a function 
of all Its arguments, we form 



^Kaplan, W, Advanced Calculus Addison Wesley Publishing Co, 
pp 128-129, 1953. 
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W — H (P j^ > P 2 ' ' ^ ’ 



(47) 



Pm’ ^ ^ [&i - '] ■ 

or 

w = - log p . + X [ZPi - 1 ] ^48) 

/ 

Taking partial derivitives of W with respect to p^, 

^ W =- p . b log p , - log p . + /^ for i = 1 , 2 , . .in . (49) 

> 

Setting this equal to zero to obtain the extreme point and 



solving for log p , 

log p^ = A - 1 (50) 

p^ = exp ( A - 1) • (51) 

Summing both sides of (51) on i and solving for exp (A ~ 1) « 
exp ( A ~ 1) = 1/m, (52) 

Substituting (52) into (51) , 

p^ = 1/m. (53) 



Thus the probability distribution for which the average un- 
certainty is the maximum is the uniform distribution, or 

^(p.^ Po^***^P ^ H(l/m, l/m,...,l/m) (54) 

12 m 

with equality when p^ = l/m„ 

If we have two sets, one with m events, the other with 

m + 1 events, the maximum average uncertainty of the smaller 

set is less than the maximum average uncertainty of the 

larger one . This can be seen by computing and noting the 

right side of (54) for both cases. 

- log 1/m < - log ^ for m ^ 1. (55) 

m + 1 
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Thus 



A, 

ID ID 



-) 

ID 



Hf--r , 
m+ 1 



iD+1 ' ’ ■ ' m+1 



) 



(56) 



Consider the foriD of H when one event say id, is composed 

of two sub-events such that p = q + q_ . 

m L 2 



rv)- I 

-1 S- V “ jFi '’i '^1' 



and 



H(p . P-. . . P 
1 ^ in 



H(p , P , p ) = Vp log p + p log 
L ^ m / . i i m 



P (58) 

ro 



Subtracting (57) from (56) and combining the logarithms of 
the terms on the extreme right 



H(Pi- 



p 1 q, ^ q,) = H(p , p , — p ) + p^ H 
m«l 1 2 12 m m 



Pm' Pm 



(59) 



We had previously stated that H was descriptive of its 
set of arguments but not that it was unique. Shannon jjll, 12^ 
shows that the only function continuous in p^ and possessing 
properties (56) and (59) is - P^' where is an 

arbitrary constant. Khinchin 6 j shows the same result 
using properties (44) , (46) and (54) . 

For applications of this statistic the reader is referred 
to the bibliography. 
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